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Abstract. In this paper we review the parallel solution of sparse linear systems, usually deriving 
by the discretization of ODE-IVPs or ODE-BVPs. The approach is based on the concept of parallel 
factorization of a (block) tridiagonal matrix. This allows to obtain efficient parallel extensions of 
many known matrix factorizations, and to derive, as a by-product, a unifying approach to the parallel 
solution of ODEs. 
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1. Introduction. The numerical solution of ODEs requires the solution of sparse 
and structured linear systems. The parallel solution of these problems may be ob- 
tained in two ways: for BVPs, since the size of the associated linear system is large, 
we need to develop parallel solvers for the obtained linear systems; for IVPs we need 
to define appropriate numerical methods that allow to obtain parallelizable linear 
systems. 

In both cases, the main problem can then be taken back to the solution of special 
sparse linear systems, whose solution is here approached through the use of parallel 
factorizations, originally introduced for deriving efficient parallel tridiagonal solvers [21 
[9], and subsequently generalized to block tridiagonal, Almost Block Diagonal (ABD), 
and Bordered Almost Block Diagonal (BABD) systems Q32 QU US EEE1 ED 01] • 

With this premise, the structure of the paper is the following: in Section [2] the 
main facts about parallel factorizations and their extensions are briefly recalled; then, 
in Section[3jtheir application for solving ODE problems is sketched; finally, in Section[4] 
we show that this approach also encompasses the so called "Parareal" algorithm, 
recently introduced in [231 121] • 

2. Parallel factorizations. In this section we consider several parallel algo- 
rithms in the class of partition methods for the solution of linear systems, 

Ax = /, (2.1) 

where A is a n x n sparse and structured matrix, and x and / are vectors of length n. 
We will investigate the parallel solution of (|2.ip on p processors, supposing p<nin 
order for the number of sequential operations to be much smaller than that of parallel 
ones. 

The coefficient matrices A here considered are (block) banded, tridiagonal, bidiag- 
onal, or even Almost Block Diagonal (ABD). All these structures may be rearranged 



*Work developed within the project "Numerical methods and software for differential equations". 

tDipartimento di Matematica, Universita di Bari, Bari, Italy (amodioadm.uniba.it). 

* Dipartimcnto di Matematica, Universita di Firenze, Firenze, Italy (luigi.brugnanoSunifi.it). 



1 



2 



P. AMODIO AND L. BRUGNANO 



Fig. 2.1. Partitioning of a banded matrix. Each point represents a (block) entry of the matrix. 



in the form 
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where the diagonal blocks are square and the superscript (i) indicates that this block 
is handled only by processor i. The size of the blocks cft\ A«, and is in 
general independent of both i and j, and only depends on the sparsity structure of the 
coefficient matrix A. In particular, the size of the blocks is quite important, since 
the sequential section of the algorithm is proportional to it. Therefore, the blocks 
a.W should be as small as possible. As an example, if A is (block) tridiagonal, 
reduces to a single (block) entry. Vice versa, in case of banded (block) matrices, the 
(block) size of aW equals to max(s, r), where s and r denote the number of lower and 
upper off (block) diagonals (see Figure [2TT|) . respectively. In case of ABD matrices, 
is a block of size equal to the number of rows in each block row of the coefficient 



matrix (see Figure [^2]) . Since row and column permutations inside each block do 
not destroy the sparsity structure of the coefficient matrix, in ABD matrices we may 
permute the elements inside to improve stability properties. Blocks AW have the 
same sparsity structure as the original matrix, and are locally handled by using any 
suitable sequential algorithm. 

In order to keep track of any parallel algorithm, we consider the following factor- 
ization [2j [9] 



A = FTG, 



(2.3) 
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Fig. 2.2. Partitioning of an ABD matrix. Each point represents an entry of the matrix. 
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I, I and o are identity and null matrices of appropriate sizes, and jV^SW is any 
suitable factorization of the block A^ l \ The remaining entries of F, T, and G can be 
derived from (|2.3[) by direct identification. 

Factorization (|2.3p may be computed in parallel on the p processors. For sim- 
plicity, we analyze the factorization of the sub-matrix identified by the superscript (?) 
(with obvious differences for i = 1 and i = p) 
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and the other entries are the same as defined in (|2.4|) - (|2.6|) . 

Consequently, by considering any given factorization for it is possible to 

derive corresponding parallel extensions of such factorizations, which cover most of 
the parallel algorithms in the class of partition methods. In particular, the following 
ones easily derive for matrices with well conditioned sub-blocks A" (this means, for 
example, that pivoting is unnecessary or does not destroy the sparsity structure): 

• LU factorization, by setting in (2l| = and = where 
l,(vjjw is the LU factorization of the matrix A". In this case, the (block) 
vectors j/W and maintain the same sparsity structure as that of and 
of 1 , respectively, while the vectors and are non-null fill-in (block) 
vectors, obtained by solving two triangular systems. 

• LU D factorization (which derives from the Gauss- Jordan elimination algo- 
rithm), by setting in (|2.8[) — a diagonal matrix, and 
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where and are lower and upper triangular matrices, respectively, with 
unit diagonal. Therefore, and maintain the same sparsity structure 
as that of b± and Cq , respectively, while «W and yW are non-null fill-in 
(block) vectors. 

• cyclic reduction algorithm [2j [9] (see also [IJ [TH [15l [16]), which is one of 
the best known parallel algorithms but that, in its original form, requires a 
synchronization at each step of reduction. In fact, the idea of this algorithm is 
to perform several reductions that, at each step, halve the size of the system. 
On the other hand, to obtain a factorization in the form (|2.8[) . it is possible 
to consider cyclic reduction as a sequential algorithm to be applied locally, 

M« = {P^L^P^Lf ■ ■ ■ • ■ ^pW'&fpf), 

where P£ are suitable permutation matrices that maintain the first and last 
row in the reduced matrix. The computational cost, which is much higher if 
the algorithm is applied to A on a sequential computer, becomes comparable 
to the previous local factorizations since this algorithm does not compute fill- 
in block vectors. As a consequence, the corresponding parallel factorization 
algorithm turns out to be one of the most effective. 

• Alternate row and column elimination |25| which is an algorithm suitable for 
ABD matrices. In fact, for such matrices alternate row and column permu- 
tations always guarantee stability without fill-in. This feature extends to the 
parallel algorithm, by taking into account that row permutations between the 
first block row of A® and the block containing (see (|2.7|l ). still make the 
parallel algorithm stable without introducing fill-in. Such parallel factoriza- 
tion is defined by setting = p(<)£(<) and S« = [/WqM, where P« and 
QW are permutation matrices and and after a suitable reordering 
of the rows and of the columns, are 2x2 block triangular matrices (see [TU] 
for full details). Finally, the (block) vectors yW and «W maintain the same 
sparsity structure as that of b± and \ respectively, whereas and «W 
are fill-in (block) vectors. 

For what concerns the solution of the systems associated to the previous parallel 
factorizations, there is much parallelism inside. The solution of the systems with the 
matrices F and G may proceed in parallel on the different processors. Conversely, the 
solution of the system with the matrix T requires a sequential part, consisting in the 
solution of a reduced system with the (block) tridiagonal reduced matrix 

( a« 7 ( 2 ) \ 
0(2) tt (2) ".. 

7 (p- 1 ) 



(2.9) 



We observe that the (block) size of T p only depends on p and is independent of n. 

For a matrix A with singular or ill-conditioned sub-blocks A^ l \ the local factor- 
izations may be unstable or even undefined. Consequently, it is necessary to slightly 
modify the factorization (|2.8p . in order to obtain stable parallel algorithms. The ba- 
sic idea is that factorization (|2.8|) may produce more than two entries in the reduced 
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system. In other words, the factorization of A^ is stopped when the considered 
sub-block is ill-conditioned (or the local factorization with a singular factor). As a 
consequence, the size of the reduced system is increased as sketched below. Let then 
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when the sub-block A{' of A®, 

«-($*)(*'*)• 

is singular, because the block is singular (i.e., a 2 = 0, in the scalar case). Then, 

(i) . ... A (i) 

a 2 is introduced in the reduced system. By iterating this procedure on D\ , we 
obtain again the factorization (|2.3p . with the only difference that now the reduced 
matrix in T p may be of (block) size larger than p — 1 (compare with (|2.9[1 ). However, 
it may be shown that it still depends only on p, whereas it is independent of n [3J [3] . 
Consequently, the scalar section of the whole algorithm is still negligible, when n 3> p. 
The parallel algorithms that fall in this class are [3l [4] : 

• LU factorization with partial pivoting, defined by setting N± = (P^) T 
and — where P± is a permutation matrix such that L^U^ is 

the LU factorization of P^A^. The remaining (block) vectors are defined 
similarly as in the case of the LU factorization previously described. 

• QR factorization, defined by setting = and S± — R[ . In this case 
both Wi and are fill-in (block) vectors while vf 1 and y± maintain the 
same sparsity structure as that of the corresponding (block) vectors in AfW . 

Factorization l|2.3[) - (|2.6p . and the corresponding parallel algorithms mentioned 
above, are easily generalized to matrices with additional non-null elements in the 
right-lower and/or left-upper corners. This is the case, for example, of Bordered ABD 
(BABD) matrices (see Figure I2.3[) and matrices with a circulant-like structure (see 
Figure 12. 4[) . Supposing the non-null elements are located in the right- upper corner 
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Fig. 2.3. Partitioning of a BABD matrix. Each point represents an entry of the matrix. 



(this is always possible by means of suitable permutation), then the coefficient matrix 
is partitioned in the form 
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where & is the smallest rectangular block containing all the corner elements. 

A factorization similar to that in (|2.3[) - (|2.6[) (the obvious differences are related 
to the first and last (block) rows) produces a corresponding reduced system with the 
reduced matrix 



( a(°) 
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We observe that, for the very important classes of BABD and circulant-like ma- 
trices (the latter, after a suitable row permutation, see Figure |2~5|) . both the matrix 
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Fig. 2.4. Partitioning of a matrix with a circulant-like structure. Each point represents a 
(block) entry of the matrix. 



p.lOj) and the reduced matrix (|2.11|) have the form of a lower block bidiagonal matrix 
(i.e., c^p = and 7W — for all i and j) with an additional right-upper corner block: 
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We have also to note that, for this kind of matrices, the overall computational 
cost of a parallel factorization algorithm has a very small increase. On a sequential 
machine, supposing to maintain the same partitioning of the matrix on p > 1 proces- 
sors, we have a computational cost which is similar to that of any efficient sequential 
algorithm (the corner block b implies the construction of a fill-in (block) vector) but 
with better stability properties (see [Hj). For this reason, a method that is widely 
and efficiently applied to matrices in the form (|2.10p . also on a sequential computer, is 
cyclic reduction (see [TTJ [T71 [TH] where cyclic reduction is applied to B ABD matrices) . 
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Fig. 2.5. Partitioning of the matrix in Fig. \2.4\ after row permutation. Each point represents a 
(block) entry of the matrix. 

3. Parallel solution of differential equations. The numerical solution of 
ODE-BVPs leads to the solution of large and sparse linear systems of equations that 
represent the most expensive part of a BVP code. The sparsity structure of the 
obtained problem depends on the methods implemented. In general, one-step methods 
lead to ABD or BABD matrices, depending on the boundary conditions (separated 
or not, respectively), while multistep methods lead to block banded systems (with 
additional corner blocks in case of non-separated boundary conditions) 5, 6[[22]. The 
parallel algorithms previously described perfectly cope with this kind of systems. For 
this reason we do not investigate further on ODE-BVPs. 

Conversely, we shall now consider the application of parallel factorizations for 
deriving parallel algorithms for numerically solving ODE-IVPs, which we assume, for 
sake of simplicity, to be linear and in the form 

y' = Ly + g(t), te[t ,T], y(t )=y a eR m , (3.1) 
which is, however, sufficient to grasp the main features of the approach [5l [6l 171 [8l [20l 

EUE2]. 

Let us consider a suitable coarse mesh, defined by the following partition of the 
integration interval in (13. 

t = r < n < • • • < t p = T. (3.2) 

Suppose, for simplicity, that inside each sub-interval we apply a given method 
with a constant stepsize 



, n—Ti-i . 

hi = — , i = l,...,p, 



(3.3) 
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to approximate the problem 

y' = Ly + g{t), t€[n-i,Ti], y{n-i)=yoi, i=l,...,p. (3.4) 
If y(t) denotes the solution of problem (|3.1[) . and we denote by 

y n i~y(n-i+nhi), n = 0,...,N, i=l,...,p, (3.5) 

the entries of the discrete approximation, then, in order for the numerical solutions 
of (|3.ip and (|3.4|) to be equivalent, we require that (see lj3.2[) and (|3.5|0 

yox = yo, Voi=VN,i-i, i = 2,...,p. (3.6) 
For convention, we also set 

2/oi = 2/jvo- (3.7) 

Let now suppose that the numerical approximations to the solutions of (|3.4p are 
obtained by solving discrete problems in the form 



M i y i = Viy i 



yii, 



VNi 
mNX7 



1, 



(3.8) 



where the matrices M t G M. mNxmN and v t e K mAr x™, and the vector g . e R mW ^ do 
depend on the chosen method (see, e.g., [5, 6 , for the case of block BVMs) and on the 
problems (|3.4|) . Clearly, this is a quite general framework, which encompasses most 
of the currently available methods for solving ODE-IVPs. By taking into account all 
the above facts, one obtains that the global approximation to the solution of (|3.ip 
is obtained by solving a discrete problem in the form (hereafter, I r will denote the 
identity matrix of dimension r): 
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(3.9) 



Obviously, this problem may be solved in a sequential fashion, by means of the 
iteration (see ([3~5|) -([3~7 | ): 



2/jvo = 2/o, M i y i =g i +v i yN,i-i, i = l,...,p. 

Nevertheless, by following arguments similar to those in the previous section, we 
consider the factorization: 
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where (see p.gp ) 



W l = [0\ Wi] € R mNxmN , Wi = Mr^i e R mArxra . 
Consequently, at first we solve, in parallel, the systems 

MiZi = g % , Zi = ( zu, . . . , z Ni ) T , i = 1, . . . ,p, 
and, then, (see (|3.10[) and (|3.6p ) recursively update the local solutions, 



(3.10) 



(3.11) 



y 1 = zi + tuiyoi, 

y i = Zi + W l y l _ 1 =Zi+ w.yoi, i = 2,...,p. 



(3.12) 



The latter recursion, however, has still much parallelism. Indeed, if we consider 
the partitionings (see (|3~g|) , (|3"XT]) . and (|3~TU)) ) 
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then (|3.12p is equivalent to solve, at first, the reduced system 
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i.e.. 



2/oi=yo, 2/o,i+i = z m + w Ni yoi, i = l,...,p-l, 
after which performing the p parallel updates 



(3.14) 



(3.15) 



y^Zt+Wiyoi, i = l,...,p-l, y p = z p + w p y p. (3.16) 
We observe that: 
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• the parallel solution of the p systems in p. Ill) is equivalent to compute the 
approximate solution of the following p ODE-IVPs, 

z' = Lz + g(t), te[ri-i,Ti], z(rj_i)=0, i=l,...,p, (3.17) 

in place of the corresponding ones in (|3.4[) ; 

• the solution of the reduced system (|3.14p - (13.15|) consists in computing the 
proper initial values {j/oi} for the previous ODE-IVPs; 

• the parallel updates (|3.16p update the approximate solutions of the ODE- 
IVPs (|3~T7)1 to those of the corresponding ODE-IVPs in lpT4")l . 

Remark 1. Clearly, the solution of the first (parallel) system in (|3 . 11 [) and the 
first (parallel) update in V3.12) (see also (|3.16p ) can be executed together, by solving 
the linear system (see (|3.6|) ) 

M 1 y 1 =g 1 +v 1 y , (3.18) 

thus directly providing the final discrete approximation on the first processor; indeed, 
this is possible, since the initial condition yo is given. 

We end this section by emphasizing that one obtains an almost perfect parallel 
speed-up, if p processors are used, provided that the cost for the solution of the reduced 
system (|3. 14[) and of the parallel updates (|3. 16[) is small, with respect to that of (|3. 11[) 
(see [5J El for more details). This is, indeed, the case when the parameter N in (|3.3|) 
is large enough and the coarse partition (|3.2[) can be supposed to be a priori given. 

4. Connections with the "Parareal" algorithm. We now briefly describe 
the "Parareal" algorithm introduced in [23 , 24 , showing the existing connections with 
the parallel method previously described. This method, originally defined for solving 
PDE problems, for example linear or quasi-linear parabolic problems, can be directly 
cast into the ODE setting via the semi-discretization of the space variables; that is, 
by using the method of lines. In more detail, let us consider the problem 

^y = £y, *e[to,T], y(t ) = y , (4.1) 

where C is an operator from a Hilbert space V into V . Let us consider again the 
partition (|3.2p of the time interval, and consider the problems 

d 

-7^y = £y, t£[n-x,Ti\, y{n-x) = VQh i = i,...,p. (4.2) 

Clearly, in order for (|4.ip and (|4.2|) to be equivalent, one must require that 

yoi=y(n-%), i = l,...,p. (4.3) 

The initial data (I4.3[) are then formally related by means of suitable propagators T% 
such that 



yo,i+i = Fiyoi, i=l,...,p-l. 



(4.4) 
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The previous relations can be cast in matrix form as (X now denotes the identity 
operator) 
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For solving (|4.5p . the authors essentially define the splitting 



F=(F-G) + G, G 
with coarse propagators 
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and consider the iterative procedure 
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with an obvious meaning of the upper index. This is equivalent to solve the problems 
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(4.6) 



thus providing good parallel features, if we can assume that the coarse operators Gi 
are "cheap" enough. The iteration (|4.6j) defines the "Parareal" algorithm, which is 
iterated until 



^1 



,P, 



are suitably small. In the practice, in case of linear operators, problem (|4.ip becomes, 
via the method of lines, an ODE in the form (|3.ip . with L a huge and very sparse 
matrix. Consequently, problems (|4.2ll become in the form (|3.4[1 . Similarly, the prop- 
agator Ti consists in the application of a suitable discrete method for approximating 
the solution of the corresponding ith problem in (|3.4[) , and the coarse propagator Gi 
describes the application of a much cheaper method for solving the same problem. 
As a consequence, if the discrete problems corresponding to the propagators {J-i} are 
in the form (|3.8|) . then the discrete version of the recurrence (|4.4j) becomes exactly 
(|3.15j) . as well as the discrete counterpart of the matrix form (|4.5p becomes (|3.14p . 



We can then conclude that the "Parareal" algorithm in [23J [24] exactly coincides 
with the iterative solution of the reduced system (|3.14p . induced by a suitable splitting. 
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We observe that the previous iterative procedure may be very appropriate, when 
the matrix L is large and sparse since, in this case, the computations of the block 
vectors {wi} in (|3.10|) , and then of the matrices {wNi} (see (|3.13p ) would be clearly 
impractical. Moreover, it can be considerably improved by observing that 

WNiVoi ~ e {Ti ~ Ti - l)L y 0i . 

Consequently, by considering a suitable approximation to the matrix exponential, the 
corresponding parallel algorithm turns out to become semi-iterative and potentially 
very effective, as recently shown in [S]. 
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